On Fri, 2009-08-21 at 17:22 +0200, dimazest at gmail.com
wrote:> I use python xapian bindings to stem strings and get this behavior:
>
> Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
> Type "help", "copyright", "credits" or
"license" for more information.
> >>> import xapian
> >>> xapian.version_string()
> '1.0.14'
> >>> s = xapian.Stem('en')
> >>> s('editing')
> 'edit'
> >>> s('Editing')
> 'Edite'
>
> Is it a bug or a feature, that for the word 'Editing' different
result
> is returned than for edit?
Hi Dima,
I think the stemmer is ignoring uppercase token prefixes. So in the
second case it's actually stemming the word "diting". This likely
related Xapian's term prefixes, which are all uppercase:
http://xapian.org/docs/omega/termprefixes.html
The stemming algorithm treats English words starting with
consonant-vowel-consonant differently, to handle words like duping ->
dupe, doting -> dote etc.
Actually, it's more complicated than that:
http://snowball.tartarus.org/algorithms/english/stemmer.html
John.
--
http://johnleach.co.uk