Inspecting a real-life index gives me a lot of R strings of Rbla-------------------------------------------- and a lot of ------------------- terms. Should not '----" and "====" or "****" better be stemmed to let's say 3 chars? "---" For all languages of course. -- Reini
On Mon, Aug 07, 2006 at 05:06:54PM +0200, Reini Urban wrote:> Inspecting a real-life index gives me a lot of R strings of > Rbla-------------------------------------------- > and a lot of ------------------- terms. > > Should not '----" and "====" or "****" better be stemmed to let's say 3 > chars? > "---" > > For all languages of course.R-terms are non-stemmed; they are the raw terms, in case you need them for direct searching. If you are getting a lot of strange characters in your raw terms, you might be better off trimming them out before throwing the whole lot at scriptindex (since that is in practice easier than doing this with omindex). James -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
On Mon, Aug 07, 2006 at 05:06:54PM +0200, Reini Urban wrote:> Inspecting a real-life index gives me a lot of R strings of > Rbla--------------------------------------------This is intended to allow indexing of things like Cl- (a chloride ion) but there's currently no sanity check on the number of minuses. As you indicate, there ought to be really. I'm actually somewhat doubtful that it's all that useful. People complain if they can't search for C++ and C#, but keeping any trailing "-" as part of the term risks gluing hyphens onto words if there's no space between a word and a following hyphen.> and a lot of ------------------- terms.You shouldn't get terms consisting only of punctuation. If you do, that's definitely a bug.> Should not '----" and "====" or "****" better be stemmed to let's say 3 > chars? > "---""=" and "*" are never included in terms, so they aren't an issue. Cheers, Olly
On Mon, Aug 07, 2006 at 05:06:54PM +0200, Reini Urban wrote:> Inspecting a real-life index gives me a lot of R strings of > Rbla--------------------------------------------OK, try this simple patch: http://www.oligarchy.co.uk/xapian/patches/omega-no-long-symbol-suffixes.patch Cheers, Olly