emmanuel at engelhart.org
2010-Jan-28 10:50 UTC
[Xapian-discuss] Problem getting Xapian working with Burmese
On Fri, Aug 21, 2009 at 02:44:44PM +0200, emmanuel at engelhart.org wrote:>> I want to update my request. >> Is my question bad formulated? too trivial? ... or maybe pretty >> complicated/unclear? > >I think nobody answered as it was hard to follow your example because >the Burmese characters seem to have been mangled (at least the message I >received wasn't valid utf-8). > >But looking at the code, I see an issue: > >> my $db = Search::Xapian::Database->new( './xapdb' ); >> my $enq = $db->enquire( $ARGV[0] ); > >What this does is to create an Enquire object and set Query($ARGV[0]) as >the query. That works OK if $ARGV[0] is a single word which gets >indexed as a single term, but you really want to parse the query string >to get a Query object: > > my $db = Search::Xapian::Database->new( './xapdb' ); > my $queryparser = Search::Xapian::QueryParser->new(); > my $query = $queryparser->parse_query( $ARGV[0] ); > my $enq = $db->enquire( $query ); > >I'd guess that is probably your problem, but I can't tell for sure as I >can't test your examples... > >For further information on debugging this sort of problem, see: > >http://trac.xapian.org/wiki/FAQ/NoMatches >Hi Olly, thank vor your answer (and sorry not having answered before). Your answer helped me and I think I now understand why "it does not work". For test purpose I index one document with one string with index_text_without_positions() (C++ API) the string "??????????????????????????" See this log: http://tmp.kiwix.org/tmp/kiwix-index.log (utf8 encoded) But if I run "delve -r 1 /path/to/db" on the index I get following answer: Term List for record #1: test ? ? ? ? ? ? (utf8 encoded) See the log : http://tmp.kiwix.org/tmp/delve.log So, it seems to be clear for me why "it does not work" : my word is splitted in single lletters and a lot of letters are removed. Do I'm right? Do we can avoid that and index "??????????????????????????" as only one word? Regards Emmanuel
Emmanuel Engelhart
2010-Jan-29 17:09 UTC
[Xapian-discuss] Problem getting Xapian working with Burmese
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 emmanuel at engelhart.org a ?crit :> On Fri, Aug 21, 2009 at 02:44:44PM +0200, emmanuel at engelhart.org wrote: >>> I want to update my request. >>> Is my question bad formulated? too trivial? ... or maybe pretty >>> complicated/unclear? >> I think nobody answered as it was hard to follow your example because >> the Burmese characters seem to have been mangled (at least the message I >> received wasn't valid utf-8).I think the root cause is not an encoding issue because because it is displayed correctly be me. Burmese characters are often not available per default on systems so you have to install them by yourself. If you can not see the Burmese characters here with you browser: http://my.wikipedia.org ... that means you have not a Burmese compatible font installed on your OS. You can try to install this font set: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=Padauk Regards Emmanuel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktjFl0ACgkQn3IpJRpNWtPsQQCfWLVSmDWE+Xj5Uva44gRwQejK tE0AnAh9+beJNb0ADmf25zUIPwGhxs9L =qhre -----END PGP SIGNATURE-----
Emmanuel Engelhart
2010-Jan-31 10:31 UTC
[Xapian-discuss] Problem getting Xapian working with Burmese
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 emmanuel at engelhart.org a ?crit :> On Fri, Aug 21, 2009 at 02:44:44PM +0200, emmanuel at engelhart.org wrote: >>> I want to update my request. >>> Is my question bad formulated? too trivial? ... or maybe pretty >>> complicated/unclear? >> I think nobody answered as it was hard to follow your example because >> the Burmese characters seem to have been mangled (at least the message I >> received wasn't valid utf-8). >> >> But looking at the code, I see an issue: >> >>> my $db = Search::Xapian::Database->new( './xapdb' ); >>> my $enq = $db->enquire( $ARGV[0] ); >> What this does is to create an Enquire object and set Query($ARGV[0]) as >> the query. That works OK if $ARGV[0] is a single word which gets >> indexed as a single term, but you really want to parse the query string >> to get a Query object: >> >> my $db = Search::Xapian::Database->new( './xapdb' ); >> my $queryparser = Search::Xapian::QueryParser->new(); >> my $query = $queryparser->parse_query( $ARGV[0] ); >> my $enq = $db->enquire( $query ); >> >> I'd guess that is probably your problem, but I can't tell for sure as I >> can't test your examples... >> >> For further information on debugging this sort of problem, see: >> >> http://trac.xapian.org/wiki/FAQ/NoMatches >> > > Hi Olly, > > thank vor your answer (and sorry not having answered before). > > Your answer helped me and I think I now understand why "it does not work". > > For test purpose I index one document with one string with index_text_without_positions() (C++ API) the string "??????????????????????????" > See this log: http://tmp.kiwix.org/tmp/kiwix-index.log (utf8 encoded) > > But if I run "delve -r 1 /path/to/db" on the index I get following answer: > Term List for record #1: test ? ? ? ? ? ? (utf8 encoded) > See the log : http://tmp.kiwix.org/tmp/delve.log > > So, it seems to be clear for me why "it does not work" : my word is splitted in single lletters and a lot of letters are removed. > > Do I'm right? Do we can avoid that and index "??????????????????????????" as only one word?I think, I more or less have understood what is wrong. "?????" is the name of "Paris" in Burmese. Here is the result of delve -r 1: Term List for record #1: ? ?? We can see that the diacritics were removed... and I think here is the issue: the diacritics are interpreted as SEPARATOR by the tokenizer and that should not be the case because they are not "alone", but "belongs to a letter". Maybe something is wrong in Utf8Iterator or in is_wordchar()? Regards Emmanuel -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktlW+EACgkQn3IpJRpNWtNO9ACfXLkaFzPx5tSnoyaT+gwAshPx rloAn2jVN5Ho+ix5apCJbt/mmulJt69+ =Z3P4 -----END PGP SIGNATURE-----
emmanuel at engelhart.org
2010-Feb-02 09:44 UTC
[Xapian-discuss] Problem getting Xapian working with Burmese
Le dim 31/01/10 23:53, "Olly Betts" olly at survex.com a ?crit:> There seem to be two issues here. > > The first is with NON_SPACING_MARK characters (which I think is what > you are referring to above). In 1.1.x, these are treated as part of > the word, but this issue was reported when we were at about 1.0.11, so we > couldn't just change the behaviour of 1.0.x without breaking existing > databases. So we went for the less good but compatible approach of > making QueryParser treat these characters as phrase generators. > > This is the ticket for that issue which has more detail: > > http://trac.xapian.org/ticket/355Indeed, this seems to be the issue. I have made a test against the dev. source code and it works better (less cuts in the words).> The second issue in your case is that there are zero-width space > characters in there as well, which currently act as word breaks. These are present > to indicate acceptable places to split a word when wrapping text, so we > should ideally just strip them out when generating terms.Ok, so that may explain why they are still cuts in the words (also with the dev. code). Do I have to open a bugs for that ? Do they exist plan to fix that ? Regards Emmanuel