Hello Xapian devs, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150210/c848e9b7/attachment-0002.html>
Hello Xapian devs, I was going through the bit size projects and found the Krovetz English stemmer and I would really like to work on it. But I have a few doubts. Though implementation of krovetz stemmer isnt very hard, xapian stemmers are made with snowball. But krovetz stemmer doesnt seem to be openly implementable with snowball. Also though this is a dictionary based stemmer, the original paper doesnt give us pointers on how to create the dictionary. Though I think this can be overcome by looking at available implementations of the stemmer. As of now, I am planning to start on writing snowball code by starting with plural forms of words. I would like to have a few pointers on how to make this a successful attempt at this project. Also sorry for the last mail, was sent by mistake. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150210/cd125653/attachment-0002.html>
On 10 Feb 2015, at 17:35, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> I was going through the bit size projects and found the Krovetz English stemmer and I would really like to work on it. > But I have a few doubts. > Though implementation of krovetz stemmer isnt very hard, xapian stemmers are made with snowball. > But krovetz stemmer doesnt seem to be openly implementable with snowball.Xapian has an abstraction layer which would allow you to implement the Krovetz stemmer alongside Snowball stemmers.> Also though this is a dictionary based stemmer, the original paper doesnt give us pointers on how to create the dictionary. Though I think this can be overcome by looking at available implementations of the stemmer.The dictionary may want to be configurable so we aren?t forcing people to use our recoding rules. That also to an extent means we don?t have to immediately worry about having a good recoding list, so you can focus on building the stemmer itself first.> As of now, I am planning to start on writing snowball code by starting with plural forms of words.Because of the structure of the Krovetz algorithm, I?m not sure that using Snowball for the individual transform steps is going to be particularly easy. You might be better off doing a straight implementation, from the paper, of the entire algorithm (in C or C++). (I could be wrong, but it feels sufficiently small that integrating one or more Snowball stemmers in the middle of the algorithm might be more confusing than the whole thing in C.) J -- James Aylett, occasional trouble-maker xapian.org
Richhiey ? can you try to keep replies on the mailing list, please? That way everyone can help and benefit. On 10 Feb 2015, at 18:11, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> Yes, I had for a start written an implementation of a small part the algorithm in C++ and thought of implementing it with snowball. > But I guess doing a straight implementation of the algorithm in C++ would be easier for a start! > Also, a configurable dictionary is a good idea. Will keep that in mind. :) > Thanks for the pointers. Will update when I am done with a C++ implementation of the stemmer.Great. Let us know if you run into any problems :) J -- James Aylett, occasional trouble-maker xapian.org