On Wed, Nov 16, 2016 at 07:03:31PM +0100, Aleksandar Pavic
wrote:> I am interested for adding Serbian language as a language for stemming.
To incorporate a stemmer in Xapian, it needs to be:
* suitably licensed - MIT/X or BSD (2 or 3 clause) or similar, so that
incorporating it doesn't block relicensing
* written in either Snowball (https://snowballstem.org/) or C/C++
* accompanied by a vocabulary list with matching stems, and that also needs
to be suitably licensed (though we can probably allow GPL-ed wordlists).
It's good if we can be confident that the algorithm works well, as changing
it later results in incompatible searching of existing databases. For
example, something based on a peer reviewed paper with details of evaluation.
It would also be good to have references to any papers, etc the stemmer
aims to implement and any intentional points of deviation and the reasons
for them (otherwise someone will later report that the stemmer doesn't
follow the paper and it'll be hard to know if that's intended).
> I'm interested what kind of development/database is required to do
that,
> and I can maybe include some people from university, etc...
There are some existing stemmers for Serbian, e.g. this one in Python:
https://github.com/nikolamilosevic86/SerbianStemmer
I don't see an explicit licence stated there, but you could ask the author
to actually specify one if it seems a suitable starting point.
Cheers,
Olly