thr3ads.net - Xapian devel - [Xapian-devel] Bitsize project: Krovetz Stemmer [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2015-Feb-15 07:04 UTC

[Xapian-devel] Bitsize project: Krovetz Stemmer

Hello xapian devs,

I had shown interest in writing a krovetz stemmer for xapian and spoke to
James Aylett about it. Since it was hard to code the stemmer in snowball, I
came up with a C++ implementation of the stemmer.
But since it is a dictionary based stemmer, im having problems on deciding
how to create the dictionary.
I did check out some of the implementations of the Krovetz stemmer online
but all of them have large dictionaries and im not sure whether that would
be helpful in our case since the dictionary would be better for the user if
configurable.
I believe words such as exceptions, nationalities have to be treated
differently and have implemented that by creating a DictEntry class with a
boolean value named exception.
Any advice on how to proceed with this would be of much help :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.xapian.org/pipermail/xapian-devel/attachments/20150215/1edd32e5/attachment-0002.html>

James Aylett

2015-Feb-15 17:05 UTC

head link

[Xapian-devel] Bitsize project: Krovetz Stemmer

On 15 Feb 2015, at 07:04, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> Since [Krovetz] is a dictionary based stemmer, im having problems on
deciding how to create the dictionary.
Richhiey ? I think I recommended that you load any dictionaries you need from a
file, which could be specified when constructing the stemmer. That separates the
need to create the dictionary from implementing the feature, although we?ll have
to provide some initial dictionary eventually.

How you then structure that in your code as you load it from file and later use
it is entirely up to you. If it?s just a list of words that should be treated
specially, having a class to represent each word feels like overkill ? you can
probably do it with something like an STL container of a base_string of some
sort (std::wstring? I haven?t done much Unicode in C++ work, so others may want
to jump in and correct me here).

J

-- 
 James Aylett, occasional trouble-maker
 xapian.org

Richhiey Thomas

2015-Feb-15 19:07 UTC

head link

[Xapian-devel] Bitsize project: Krovetz Stemmer

Hello,
Yes James, will load a required dictionary according to the requirements of
the program. I have also structured the program accordingly so it shouldn't
be a problem :)
Also I did not mean a different class for every exception. Im using an
unordered_map to map the word against its dictionary entry for which I have
created a DictEntry class. It stores the word along with pointing out
whether it is an exception or not. Should work well right?
Also once I'm done, how can I have one of you review the code and help me
proceed?
Thanks.
On Feb 15, 2015 10:35 PM, "James Aylett" <james-xapian at
tartarus.org> wrote:
> On 15 Feb 2015, at 07:04, Richhiey Thomas <richhiey.thomas at
gmail.com>
> wrote:
>
> > Since [Krovetz] is a dictionary based stemmer, im having problems on
> deciding how to create the dictionary.
>
> Richhiey ? I think I recommended that you load any dictionaries you need
> from a file, which could be specified when constructing the stemmer. That
> separates the need to create the dictionary from implementing the feature,
> although we?ll have to provide some initial dictionary eventually.
>
> How you then structure that in your code as you load it from file and
> later use it is entirely up to you. If it?s just a list of words that
> should be treated specially, having a class to represent each word feels
> like overkill ? you can probably do it with something like an STL container
> of a base_string of some sort (std::wstring? I haven?t done much Unicode in
> C++ work, so others may want to jump in and correct me here).
>
> J
>
> --
>  James Aylett, occasional trouble-maker
>  xapian.org
>
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> lists.xapian.org/mailman/listinfo/xapian-devel
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.xapian.org/pipermail/xapian-devel/attachments/20150216/ba50b204/attachment-0002.html>

Olly Betts

2015-Feb-15 22:34 UTC

head link

[Xapian-devel] Bitsize project: Krovetz Stemmer

On Sun, Feb 15, 2015 at 05:05:11PM +0000, James Aylett
wrote:> How you then structure that in your code as you load it from file and
> later use it is entirely up to you. If it?s just a list of words that
> should be treated specially, having a class to represent each word
> feels like overkill ? you can probably do it with something like an
> STL container of a base_string of some sort (std::wstring? I haven?t
> done much Unicode in C++ work, so others may want to jump in and
> correct me here).
Where xapian-core cares about the encoding, it deals with UTF-8 encoded
text, which we store as const char * or std::string.  Using std::wstring
would be appropriate if we were handling wide characters, but converting
UTF-8 to and from a wide character string is likely to end up
significantly slower.  The trade-off is that iterating a UTF-8 string is
more complex than a wide character strings - there it's a simple pointer
dereference and increment per Unicode character.

If you use std::string, that is a class and it represents each word,
which as James says might indeed be overkill.  If the stemming
dictionary is potentially very large, you might want to load the file
into a single allocated block of memory and then just use const char *
into that block for the words - that would avoid the overhead of
creating a huge number of std::string objects.

Cheers,
    Olly

Reasonably Related Threads

Search for more seemingly similar threads

Xapian devel - Feb 2015 - Bitsize project: Krovetz Stemmer

[Xapian-devel] Bitsize project: Krovetz Stemmer

[Xapian-devel] Bitsize project: Krovetz Stemmer

[Xapian-devel] Bitsize project: Krovetz Stemmer

[Xapian-devel] Bitsize project: Krovetz Stemmer

Reasonably Related Threads