Oat ABCTech
2012-Nov-26 05:26 UTC
[Xapian-devel] Word missing after stemmed with Norwegian in Search::Xapian::TermGenerator
Hi all Xapian-devel, Gist: https://gist.github.com/10d2222d8bffe8d7631d I'm using Xapian-TermGenerator to extract Norwegian sentences to vsm (vector space model) using TermGenerator. But when I test generating vsm from 'Truet med ? stevne misforn?yd PC-kunde - PC-leverand?ren Asus likte sv?rt d?rlig kundens misforn?yde leserbrev.' It doen't return 'asus' result in vsm. So I've tried to replace 'Asus' with other word such as Acer, Apple, Dell, Fujitsu, HP, Lenovo, LG, NEC, Samsung, Sony and Toshiba. Most brand words I tried are able to get a result except Acer, Apple and Dell, but other words which get its name as result aren't get 'd?r'. This problem may be caused by encoding which I'm investigating now. But it would be great if you guys can help and if you guys have any question regarding this problem please reply to me Best regards, Theerapat -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20121126/3b5515c9/attachment-0001.html>
Olly Betts
2012-Nov-26 19:21 UTC
[Xapian-devel] Word missing after stemmed with Norwegian in Search::Xapian::TermGenerator
On Mon, Nov 26, 2012 at 12:26:40PM +0700, Oat ABCTech wrote:> I'm using Xapian-TermGenerator to extract Norwegian sentences to vsm > (vector space model) using TermGenerator. But when I test generating vsm > from 'Truet med ? stevne misforn?yd PC-kunde - PC-leverand?ren Asus likte > sv?rt d?rlig kundens misforn?yde leserbrev.' It doen't return 'asus' result > in vsm.Have you tried looking at the terms which are in the database? If not, try: delve /path/to/databae -t Zasus If 'Zasus' is in the database, then the problem is probably in whatever Novus is doing. If it isn't in the database, then a simpler testcase would be very helpful (especially one which doesn't pull in other modules beyond Search::Xapian). Cheers, Olly