Per Jessen
2010-May-11 13:18 UTC
[Xapian-discuss] indexing words with alternative spellings
Some languages (e.g. German and Danish) have special letters that are often written using two-letter combinations when the appropriate keyboard or medium is not available: ? = ae ? = ue ? = oe ? = ae ? = oe ? = aa ? = ss (there are undoubtedly far more examples than those) As a user of an index, I would like to be able to search for e.g. "schaefer" and get matches on both 'ae' and '?' returned. Same if I searched on 'sch?fer'. Is this something I would need to take into account when I do the indexing or? /Per Jessen, Z?rich
Oliver Flimm
2010-May-11 13:46 UTC
[Xapian-discuss] indexing words with alternative spellings
Hi, On Tue, May 11, 2010 at 03:18:38PM +0200, Per Jessen wrote:> Some languages (e.g. German and Danish) have special letters that are > often written using two-letter combinations when the appropriate > keyboard or medium is not available: > ? = ae[...]> As a user of an index, I would like to be able to search for > e.g. "schaefer" and get matches on both 'ae' and '?' returned. Same if > I searched on 'sch?fer'. Is this something I would need to take into > account when I do the indexing or?you have to take it into account both when indexing and searching. I'm using Xapian in a library catalogue and convert these "special" character to the two-letter combination - both when generating terms or postings and when processing user input. Regards, O. Flimm -- Universitaet zu Koeln :: Universitaets- und Stadtbibliothek IT-Dienste :: Abteilung Universitaetsgesamtkatalog Universitaetsstr. 33 :: D-50931 Koeln Tel.: +49 221 470-3330 :: Fax: +49 221 470-5166 flimm at ub.uni-koeln.de :: www.ub.uni-koeln.de
Michel Pelletier
2010-May-11 16:00 UTC
[Xapian-discuss] indexing words with alternative spellings
Different languages have different libraries for dealing with this issue. We use one for Python called 'translitcodec' which can do both long (? -> ae) and short (? -> a) conversion. It's very likely there is a similar library for whatever language you are using. http://pypi.python.org/pypi/translitcodec/0.1 -Mike On Tue, May 11, 2010 at 6:18 AM, Per Jessen <per at computer.org> wrote:> Some languages (e.g. German and Danish) have special letters that are > often written using two-letter combinations when the appropriate > keyboard or medium is not available: > > ? = ae > ? = ue > ? = oe > ? = ae > ? = oe > ? = aa > ? = ss > > (there are undoubtedly far more examples than those) > > As a user of an index, I would like to be able to search for > e.g. "schaefer" and get matches on both 'ae' and '?' returned. Same if > I searched on 'sch?fer'. ?Is this something I would need to take into > account when I do the indexing or? > > > /Per Jessen, Z?rich > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
Olly Betts
2010-May-13 02:06 UTC
[Xapian-discuss] indexing words with alternative spellings
On Tue, May 11, 2010 at 03:18:38PM +0200, Per Jessen wrote:> Some languages (e.g. German and Danish) have special letters that are > often written using two-letter combinations when the appropriate > keyboard or medium is not available:For German, you can use the "german2" stemmer which transliterates as you describe. There's also unac for more general accent normalisation: http://www.nongnu.org/unac/ There's actually a version 1.8.0 not mentioned there (but Debian has it). Not sure what's up, but the upstream page at http://www.senga.org/unac/ is no longer there. Cheers, Olly