Eugene!
2010-Feb-14 15:06 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
Hello Xapian developers! First of all, I'd like to thank you guys for the Xapian project at all. Great work! Xapian has decent performance and is very easy to enter. However, it has some "missing" features I really need to have, and the most noticeable is dictionary-based stemming and spelling available to be used from Python code. The current code of Xapian::Stem doesn't do anything to provide such a functionality even for C++ level, not talking about SWIG bindings at all. That is a no-go for me, thus I have to try to find some way to deal with it. I'm not a big fan of SWIG thus I have a very little knowledge of it, thus I decided not to go deep into the current Xapian SWIG bindings and concentrate on developing a standalone extension which would be a prototype or proof of concept for my work. The main idea which I've got after looked into the Xapian C++ code for Xapian::Stem if that it is mostly ready for using custome stemming engines right now! All that is required is to make it having a vtable! The easiest way to do that (an the best way for C++ subclassing) is to make the Xapian::~Stem() destructor to become virtual destructor. Also, in the first attempt I've tried to make virtual Stem::operator() but that DIDN'T work for me because of C++ type casting (e.g. for StemGenerator {{{ class XAPIAN_VISIBILITY_DEFAULT TermGenerator { public: ... /// Set the Xapian::Stem object to be used for generating stemmed terms. void set_stemmer(const Xapian::Stem & stemmer); ... } class MyStem : public Xapian::Stem { ... std::string operator()(const std::string &word) const; // we've patched Xapian::Stem to have virtual std::string operator()(const std::string &word) const; } MyStem stem("english"); Xapian::TermGenerator tg; tg,set_stemmer(stem); // C++ type cast issue - stem will be treated as Xapian::Stem and our overloaded operator() won't be used }}} Then I noticed the presence of Xapian::Stem::Internal reference-counted pointer and that led me to the working solution. What I need it a 2-step thing: 1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer (Hunspell in my case) in exactly the same way as it is done for different languages in the current code 2. Subclass the Xapian::Stem in order to create instance of my own implementation of the Xapian::Stem::Internal 3, profit That is all I need because even after my derived class will be treated as Xapian::Stem that will not be a problem any more, as it will continue using the `internal' attribute which do the actual work! The using of the derived class instance and refcounting it is supported by the copy constructor and operator=() of the Xapian::Stem : {{{ Stem::Stem(const Stem & o) : internal(o.internal) { } void Stem::operator=(const Stem & o) { internal = o.internal; } }}} The last step was to point SWIG to apply its "director" feature to the Xapian::Stem which now has the vtable. Viola, I have my Hunspell stemmer been used by Xapian during indexing and query parsing from my Python code! To share my research and to make the trivial patches been incorporated into the Xapian trunk I've created the ticked trac.xapian.org/ticket/448. But, it has been closed without been thoroughly analysed, and I am now trying to convince you guys to take a second look on it. I've attached to the ticket my minimal patches for Xapian-core and Xapian-bindings and my research prototype C++/SWIG extension for Python which demonstrated the approach. I could continue use the patched Xapian for myself, but it would be better to go upstream is it: 1. doesn't change the current architecture 2. has no side effects for the current code and usage patterns 3. is really trivial Any comments?
Kevin Duraj
2010-Feb-14 20:17 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
Eugene, Python is extremely slow interpreted programming language and is definitely not suitable for running high performance search engines. You mention that you need to modify Xapian for your work project and thus your modification would affect all the millions of Xapian users around the world. You did not mention how much salary you want to share with us but I assume that is nothing. You just want to add some overhead to Xapian and make it slower because you need it for your work. Interesting, but definitely you make me laugh this Saturday morning. Kevin Duraj myhealthcare.com On Sun, Feb 14, 2010 at 7:06 AM, Eugene! <esizikov at gmail.com> wrote:> Hello Xapian developers! > > First of all, I'd like to thank you guys for the Xapian project at > all. Great work! Xapian has decent performance and is very easy to > enter. > > However, it has some "missing" features I really need to have, and the > most noticeable is dictionary-based stemming and spelling available to > be used from Python code. > > The current code of Xapian::Stem doesn't do anything to provide such a > functionality even for C++ level, not talking about SWIG bindings at > all. > > That is a no-go for me, thus I have to try to find some way to deal with it. > > I'm not a big fan of SWIG thus I have a very little knowledge of it, > thus I decided not to go deep into the current Xapian SWIG bindings > and concentrate on developing a standalone extension which would be a > prototype or proof of concept for my work. > > The main idea which I've got after looked into the Xapian C++ code for > Xapian::Stem if that it is mostly ready for using custome stemming > engines right now! All that is required is to make it having a vtable! > The easiest way to do that (an the best way for C++ subclassing) is to > make the Xapian::~Stem() destructor to become virtual destructor. > > Also, in the first attempt I've tried to make virtual Stem::operator() > but that DIDN'T work for me because of C++ type casting (e.g. for > StemGenerator > {{{ > class XAPIAN_VISIBILITY_DEFAULT TermGenerator { > ?public: > ? ... > ? ?/// Set the Xapian::Stem object to be used for generating stemmed terms. > ? ?void set_stemmer(const Xapian::Stem & stemmer); > ?... > } > > class MyStem : public Xapian::Stem { > ?... > ? ?std::string operator()(const std::string &word) const; ? // we've > patched Xapian::Stem to have virtual ?std::string operator()(const > std::string &word) const; > } > > MyStem stem("english"); > Xapian::TermGenerator tg; > > tg,set_stemmer(stem); ? ? ?// C++ type cast issue - stem will be > treated as Xapian::Stem and our overloaded operator() won't be used > }}} > > Then I noticed the presence of Xapian::Stem::Internal > reference-counted pointer and that led me to the working solution. > What I need it a 2-step thing: > ?1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer > (Hunspell in my case) in exactly the same way as it is done for > different languages in the current code > ?2. ?Subclass the Xapian::Stem in order to create instance of my own > implementation of the Xapian::Stem::Internal > ?3, profit > > That is all I need because even after my derived class will be treated > as Xapian::Stem that will not be a problem any more, as it will > continue using the `internal' attribute which do the actual work! The > using of the derived class instance and refcounting it is supported by > the copy constructor and operator=() of the Xapian::Stem : > {{{ > Stem::Stem(const Stem & o) : internal(o.internal) { } > > void > Stem::operator=(const Stem & o) > { > ? ?internal = o.internal; > } > }}} > > The last step was to point SWIG to apply its "director" feature to the > Xapian::Stem which now has the vtable. Viola, I have my Hunspell > stemmer been used by Xapian during indexing and query parsing from my > Python code! > > To share my research and to make the trivial patches been incorporated > into the Xapian trunk I've created the ticked > trac.xapian.org/ticket/448. But, it has been closed without > been thoroughly analysed, and I am now trying to convince you guys to > take a second look on it. I've attached to the ticket my minimal > patches for Xapian-core and Xapian-bindings and my research prototype > C++/SWIG extension for Python which demonstrated the approach. > > I could continue use the patched Xapian for myself, but it would be > better to go upstream is it: > ?1. doesn't change the current architecture > ?2. has no side effects for the current code and usage patterns > ?3. is really trivial > > Any comments? > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > lists.xapian.org/mailman/listinfo/xapian-devel >
Olly Betts
2010-Feb-14 23:51 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
On Sun, Feb 14, 2010 at 09:06:38PM +0600, Eugene! wrote:> However, it has some "missing" features I really need to have, and the > most noticeable is dictionary-based stemming and spelling available to > be used from Python code.Please can we discuss this in a single place. If we split discussion between the list and the ticket, it takes more effort to follow, and will be harder to look back in the future to see why we decided what we did. Since you've already opened a ticket for this, and discussion has started there, I'd suggest we discuss it there. Cheers, Olly