Hello xapian devs, I had shown interest in writing a krovetz stemmer for xapian and spoke to James Aylett about it. Since it was hard to code the stemmer in snowball, I came up with a C++ implementation of the stemmer. But since it is a dictionary based stemmer, im having problems on deciding how to create the dictionary. I did check out some of the implementations of the Krovetz stemmer online but all of them have large dictionaries and im not sure whether that would be helpful in our case since the dictionary would be better for the user if configurable. I believe words such as exceptions, nationalities have to be treated differently and have implemented that by creating a DictEntry class with a boolean value named exception. Any advice on how to proceed with this would be of much help :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150215/1edd32e5/attachment-0002.html>
On 15 Feb 2015, at 07:04, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> Since [Krovetz] is a dictionary based stemmer, im having problems on deciding how to create the dictionary.Richhiey ? I think I recommended that you load any dictionaries you need from a file, which could be specified when constructing the stemmer. That separates the need to create the dictionary from implementing the feature, although we?ll have to provide some initial dictionary eventually. How you then structure that in your code as you load it from file and later use it is entirely up to you. If it?s just a list of words that should be treated specially, having a class to represent each word feels like overkill ? you can probably do it with something like an STL container of a base_string of some sort (std::wstring? I haven?t done much Unicode in C++ work, so others may want to jump in and correct me here). J -- James Aylett, occasional trouble-maker xapian.org
Hello, Yes James, will load a required dictionary according to the requirements of the program. I have also structured the program accordingly so it shouldn't be a problem :) Also I did not mean a different class for every exception. Im using an unordered_map to map the word against its dictionary entry for which I have created a DictEntry class. It stores the word along with pointing out whether it is an exception or not. Should work well right? Also once I'm done, how can I have one of you review the code and help me proceed? Thanks. On Feb 15, 2015 10:35 PM, "James Aylett" <james-xapian at tartarus.org> wrote:> On 15 Feb 2015, at 07:04, Richhiey Thomas <richhiey.thomas at gmail.com> > wrote: > > > Since [Krovetz] is a dictionary based stemmer, im having problems on > deciding how to create the dictionary. > > Richhiey ? I think I recommended that you load any dictionaries you need > from a file, which could be specified when constructing the stemmer. That > separates the need to create the dictionary from implementing the feature, > although we?ll have to provide some initial dictionary eventually. > > How you then structure that in your code as you load it from file and > later use it is entirely up to you. If it?s just a list of words that > should be treated specially, having a class to represent each word feels > like overkill ? you can probably do it with something like an STL container > of a base_string of some sort (std::wstring? I haven?t done much Unicode in > C++ work, so others may want to jump in and correct me here). > > J > > -- > James Aylett, occasional trouble-maker > xapian.org > > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150216/ba50b204/attachment-0002.html>
On Sun, Feb 15, 2015 at 05:05:11PM +0000, James Aylett wrote:> How you then structure that in your code as you load it from file and > later use it is entirely up to you. If it?s just a list of words that > should be treated specially, having a class to represent each word > feels like overkill ? you can probably do it with something like an > STL container of a base_string of some sort (std::wstring? I haven?t > done much Unicode in C++ work, so others may want to jump in and > correct me here).Where xapian-core cares about the encoding, it deals with UTF-8 encoded text, which we store as const char * or std::string. Using std::wstring would be appropriate if we were handling wide characters, but converting UTF-8 to and from a wide character string is likely to end up significantly slower. The trade-off is that iterating a UTF-8 string is more complex than a wide character strings - there it's a simple pointer dereference and increment per Unicode character. If you use std::string, that is a class and it represents each word, which as James says might indeed be overkill. If the stemming dictionary is potentially very large, you might want to load the file into a single allocated block of memory and then just use const char * into that block for the words - that would avoid the overhead of creating a huge number of std::string objects. Cheers, Olly