Eugene!
2010-Feb-14 15:06 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
Hello Xapian developers!
First of all, I'd like to thank you guys for the Xapian project at
all. Great work! Xapian has decent performance and is very easy to
enter.
However, it has some "missing" features I really need to have, and the
most noticeable is dictionary-based stemming and spelling available to
be used from Python code.
The current code of Xapian::Stem doesn't do anything to provide such a
functionality even for C++ level, not talking about SWIG bindings at
all.
That is a no-go for me, thus I have to try to find some way to deal with it.
I'm not a big fan of SWIG thus I have a very little knowledge of it,
thus I decided not to go deep into the current Xapian SWIG bindings
and concentrate on developing a standalone extension which would be a
prototype or proof of concept for my work.
The main idea which I've got after looked into the Xapian C++ code for
Xapian::Stem if that it is mostly ready for using custome stemming
engines right now! All that is required is to make it having a vtable!
The easiest way to do that (an the best way for C++ subclassing) is to
make the Xapian::~Stem() destructor to become virtual destructor.
Also, in the first attempt I've tried to make virtual Stem::operator()
but that DIDN'T work for me because of C++ type casting (e.g. for
StemGenerator
{{{
class XAPIAN_VISIBILITY_DEFAULT TermGenerator {
public:
...
/// Set the Xapian::Stem object to be used for generating stemmed terms.
void set_stemmer(const Xapian::Stem & stemmer);
...
}
class MyStem : public Xapian::Stem {
...
std::string operator()(const std::string &word) const; // we've
patched Xapian::Stem to have virtual std::string operator()(const
std::string &word) const;
}
MyStem stem("english");
Xapian::TermGenerator tg;
tg,set_stemmer(stem); // C++ type cast issue - stem will be
treated as Xapian::Stem and our overloaded operator() won't be used
}}}
Then I noticed the presence of Xapian::Stem::Internal
reference-counted pointer and that led me to the working solution.
What I need it a 2-step thing:
1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer
(Hunspell in my case) in exactly the same way as it is done for
different languages in the current code
2. Subclass the Xapian::Stem in order to create instance of my own
implementation of the Xapian::Stem::Internal
3, profit
That is all I need because even after my derived class will be treated
as Xapian::Stem that will not be a problem any more, as it will
continue using the `internal' attribute which do the actual work! The
using of the derived class instance and refcounting it is supported by
the copy constructor and operator=() of the Xapian::Stem :
{{{
Stem::Stem(const Stem & o) : internal(o.internal) { }
void
Stem::operator=(const Stem & o)
{
internal = o.internal;
}
}}}
The last step was to point SWIG to apply its "director" feature to the
Xapian::Stem which now has the vtable. Viola, I have my Hunspell
stemmer been used by Xapian during indexing and query parsing from my
Python code!
To share my research and to make the trivial patches been incorporated
into the Xapian trunk I've created the ticked
http://trac.xapian.org/ticket/448. But, it has been closed without
been thoroughly analysed, and I am now trying to convince you guys to
take a second look on it. I've attached to the ticket my minimal
patches for Xapian-core and Xapian-bindings and my research prototype
C++/SWIG extension for Python which demonstrated the approach.
I could continue use the patched Xapian for myself, but it would be
better to go upstream is it:
1. doesn't change the current architecture
2. has no side effects for the current code and usage patterns
3. is really trivial
Any comments?
Kevin Duraj
2010-Feb-14 20:17 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
Eugene, Python is extremely slow interpreted programming language and is definitely not suitable for running high performance search engines. You mention that you need to modify Xapian for your work project and thus your modification would affect all the millions of Xapian users around the world. You did not mention how much salary you want to share with us but I assume that is nothing. You just want to add some overhead to Xapian and make it slower because you need it for your work. Interesting, but definitely you make me laugh this Saturday morning. Kevin Duraj http://myhealthcare.com On Sun, Feb 14, 2010 at 7:06 AM, Eugene! <esizikov at gmail.com> wrote:> Hello Xapian developers! > > First of all, I'd like to thank you guys for the Xapian project at > all. Great work! Xapian has decent performance and is very easy to > enter. > > However, it has some "missing" features I really need to have, and the > most noticeable is dictionary-based stemming and spelling available to > be used from Python code. > > The current code of Xapian::Stem doesn't do anything to provide such a > functionality even for C++ level, not talking about SWIG bindings at > all. > > That is a no-go for me, thus I have to try to find some way to deal with it. > > I'm not a big fan of SWIG thus I have a very little knowledge of it, > thus I decided not to go deep into the current Xapian SWIG bindings > and concentrate on developing a standalone extension which would be a > prototype or proof of concept for my work. > > The main idea which I've got after looked into the Xapian C++ code for > Xapian::Stem if that it is mostly ready for using custome stemming > engines right now! All that is required is to make it having a vtable! > The easiest way to do that (an the best way for C++ subclassing) is to > make the Xapian::~Stem() destructor to become virtual destructor. > > Also, in the first attempt I've tried to make virtual Stem::operator() > but that DIDN'T work for me because of C++ type casting (e.g. for > StemGenerator > {{{ > class XAPIAN_VISIBILITY_DEFAULT TermGenerator { > ?public: > ? ... > ? ?/// Set the Xapian::Stem object to be used for generating stemmed terms. > ? ?void set_stemmer(const Xapian::Stem & stemmer); > ?... > } > > class MyStem : public Xapian::Stem { > ?... > ? ?std::string operator()(const std::string &word) const; ? // we've > patched Xapian::Stem to have virtual ?std::string operator()(const > std::string &word) const; > } > > MyStem stem("english"); > Xapian::TermGenerator tg; > > tg,set_stemmer(stem); ? ? ?// C++ type cast issue - stem will be > treated as Xapian::Stem and our overloaded operator() won't be used > }}} > > Then I noticed the presence of Xapian::Stem::Internal > reference-counted pointer and that led me to the working solution. > What I need it a 2-step thing: > ?1. Subclass Xapian::Stem::Internal to use dictionary-based stemmer > (Hunspell in my case) in exactly the same way as it is done for > different languages in the current code > ?2. ?Subclass the Xapian::Stem in order to create instance of my own > implementation of the Xapian::Stem::Internal > ?3, profit > > That is all I need because even after my derived class will be treated > as Xapian::Stem that will not be a problem any more, as it will > continue using the `internal' attribute which do the actual work! The > using of the derived class instance and refcounting it is supported by > the copy constructor and operator=() of the Xapian::Stem : > {{{ > Stem::Stem(const Stem & o) : internal(o.internal) { } > > void > Stem::operator=(const Stem & o) > { > ? ?internal = o.internal; > } > }}} > > The last step was to point SWIG to apply its "director" feature to the > Xapian::Stem which now has the vtable. Viola, I have my Hunspell > stemmer been used by Xapian during indexing and query parsing from my > Python code! > > To share my research and to make the trivial patches been incorporated > into the Xapian trunk I've created the ticked > http://trac.xapian.org/ticket/448. But, it has been closed without > been thoroughly analysed, and I am now trying to convince you guys to > take a second look on it. I've attached to the ticket my minimal > patches for Xapian-core and Xapian-bindings and my research prototype > C++/SWIG extension for Python which demonstrated the approach. > > I could continue use the patched Xapian for myself, but it would be > better to go upstream is it: > ?1. doesn't change the current architecture > ?2. has no side effects for the current code and usage patterns > ?3. is really trivial > > Any comments? > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel >
Olly Betts
2010-Feb-14 23:51 UTC
[Xapian-devel] let's discuss http://trac.xapian.org/ticket/448 "Allow usage of custom stemmers"
On Sun, Feb 14, 2010 at 09:06:38PM +0600, Eugene! wrote:> However, it has some "missing" features I really need to have, and the > most noticeable is dictionary-based stemming and spelling available to > be used from Python code.Please can we discuss this in a single place. If we split discussion between the list and the ticket, it takes more effort to follow, and will be harder to look back in the future to see why we decided what we did. Since you've already opened a ticket for this, and discussion has started there, I'd suggest we discuss it there. Cheers, Olly