hightman
2011-Sep-14 05:40 UTC
[Xapian-discuss] Integrated Chinese tokenizer SCWS in xapian-core
Xapian is a very excellent open source search engine library, but there is no native support for Chinese word segmentation in queryparser and termgenerator. Therefore, I modified small amount of source codes, integrated into the SCWS tokenizer, that is the same open-source and developped by myself. Anyone can obtain the patch from below URL. After patching, Xapian::QueryParser::parse_query and Xapian::Termgenerator::index_text will support chinese words segmentation directly. https://github.com/hightman/xunsearch/blob/master/xapian-scws/patch.xapian-core-scws Hope that is useful to Chinese users of xapian. ---------- The following messages is about xunsearch, that was developped upon xapian-cores and scws. Included two back-end servers written in C/C++, and front-end developement library written in PHP. It provide a more easy to use search engine solution for chinese user. ??(xunsearch)??? C/C++ ?? xapian ? scws ???????????????? PHP ???????? ????????????????????????????????????????????????????????????????????????? ????????????????????????????????????? github ?? ????????????????????????????? bug ????????????? ?????http://www.xunsearch.com/download/ ?????http://www.xunsearch.com/doc/ GIT?????http://github.com/hightman/xunsearch/
Olly Betts
2011-Sep-15 05:45 UTC
[Xapian-discuss] Integrated Chinese tokenizer SCWS in xapian-core
On Wed, Sep 14, 2011 at 01:40:25PM +0800, hightman wrote:> Xapian is a very excellent open source search engine library, but > there is no native support for Chinese word segmentation in > queryparser and termgenerator.Actually, trunk now has code for a n-gram based approach, and there is a GSoC project which has been working on adding support for segmentation using dictionaries and other heuristics, but there is certainly room for supporting multiple alternative approaches.> Therefore, I modified small amount of source codes, integrated into > the SCWS tokenizer, that is the same open-source and developped by > myself.What licence is SCWS released under? I couldn't find this information anywhere - the nearest I came was the COPYING file in the distribution. I tried converting this from BIG-5 to UTF-8, which gave plausible looking Chinese text, but Google translate just gave gibberish when I tried to convert the UTF-8 text to English to get the gist.> Anyone can obtain the patch from below URL. After patching, > Xapian::QueryParser::parse_query and Xapian::Termgenerator::index_text > will support chinese words segmentation directly. > > https://github.com/hightman/xunsearch/blob/master/xapian-scws/patch.xapian-core-scwsThanks for the patch. If you want to get this integrated into Xapian releases, we really need a patch against trunk (this one won't apply cleanly, since it hooks in to the same places as the new n-gram CJK code). We also really need test coverage for the added code, so we know that it actually works and to help ensure it isn't broken by future changes. Also, please confirm that you're happy to license the patch suitably - see "Licensing of patches" in HACKING: http://trac.xapian.org/browser/trunk/xapian-core/HACKING#L1203 Cheers, Olly
Apparently Analagous Threads
- New scws patch for xapian-core based on svn trunk
- Incorrect get_matches_estimated() of Xapian::Mset
- [issue] The difference between QueryParser::FLAG_AUTO_SYNONYMS and QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS
- Problem indexing text with spelling enabled in Perl
- Crashes with spelling enabled and perl.