athlon athlonf
2007-Oct-18 20:47 UTC
[Xapian-discuss] UTF-8 becomes glibberish in searches
I'm using dbi2omega and scriptindex to index a database with chinese characters. Searches are done with php4-bindings. While the index-file is in utf8, the results from the searches are glibberish. These characters (changed to htmlencoding for this message) ?? becomes something like this: ??????? What am I doing wrong here? Is it the indexing, or is it the searching? How can I check if the database is indeed in utf-8? I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2 of xapian. This is part of the code I use to get the results // Start an enquire session. $enquire = new XapianEnquire($database); $query_string = $_POST['terms']; $qp = new XapianQueryParser(); $stemmer = new XapianStem("english"); $qp->set_stemmer($stemmer); $qp->set_database($database); $qp->set_stemming_strategy(XapianQueryParser_STEM_SOME); $qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) ); $qp->set_default_op( OP_AND ); $query = $qp->parse_query($query_string); print "Parsed query is: " . $query->get_description(). "<br/>"; // Find the top 10 results for the query. $enquire->set_query($query); $enquire->set_sort_by_relevance_then_value(1,1); $matches = $enquire->get_mset(0, 10); // Display the results. print $matches->get_matches_estimated()." results found:\n"; echo "<pre>"; $i = $matches->begin(); while (!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $document = $i->get_document(); $data = $document->get_data(); foreach (split("\n", $data) as $line) { $nameval = split("=", $line, 2); $field[$nameval[0]] = $nameval[1]; } print_r($field); echo "$n: ". $i->get_percent()." % id=:". $i->get_docid(); __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Thu, Oct 18, 2007 at 12:47:21PM -0700, athlon athlonf wrote:> I'm using dbi2omega and scriptindex to index a database with chinese > characters. > Searches are done with php4-bindings. > > While the index-file is in utf8, the results from the searches are > glibberish. > > These characters (changed to htmlencoding for this message) > ?????? becomes something like this: ??????????I just see "?" and inverse "?" here in mutt I'm afraid...> What am I doing wrong here? Is it the indexing, or is it the searching?You need to step through the process, checking that everything is OK after each step. It could be dbi2omega is wrong, or scriptindex, or xapian itself, or the PHP bindings. First of all, I'd run dbi2omega redirected to a file, and then see if the UTF-8 is correct in that file.> How can I check if the database is indeed in utf-8?Use the "delve" utility (in xapian-core, examples/delve) to look at the terms for a few documents. If both dbi2omega and the database look OK, then it's probably the PHP bindings. If you're writing the results as a web page, have you set the character set of the webpage to UTF-8 correctly? Check what your web browser says its character set is. Cheers, Olly
Try on top of your php code header("Content-type: text/html; charset=utf-8"); "athlon athlonf" <athlonkmf@yahoo.com> wrote in message news:549946.81147.qm@web31011.mail.mud.yahoo.com... I'm using dbi2omega and scriptindex to index a database with chinese characters. Searches are done with php4-bindings. While the index-file is in utf8, the results from the searches are glibberish. These characters (changed to htmlencoding for this message) ¦P¨Æ becomes something like this: a?ao?a¡M What am I doing wrong here? Is it the indexing, or is it the searching? How can I check if the database is indeed in utf-8? I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2 of xapian. This is part of the code I use to get the results // Start an enquire session. $enquire = new XapianEnquire($database); $query_string = $_POST['terms']; $qp = new XapianQueryParser(); $stemmer = new XapianStem("english"); $qp->set_stemmer($stemmer); $qp->set_database($database); $qp->set_stemming_strategy(XapianQueryParser_STEM_SOME); $qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) ); $qp->set_default_op( OP_AND ); $query = $qp->parse_query($query_string); print "Parsed query is: " . $query->get_description(). "<br/>"; // Find the top 10 results for the query. $enquire->set_query($query); $enquire->set_sort_by_relevance_then_value(1,1); $matches = $enquire->get_mset(0, 10); // Display the results. print $matches->get_matches_estimated()." results found:\n"; echo "<pre>"; $i = $matches->begin(); while (!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $document = $i->get_document(); $data = $document->get_data(); foreach (split("\n", $data) as $line) { $nameval = split("=", $line, 2); $field[$nameval[0]] = $nameval[1]; } print_r($field); echo "$n: ". $i->get_percent()." % id=:". $i->get_docid(); __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
athlon athlonf
2007-Oct-21 19:37 UTC
[Xapian-discuss] Re: UTF-8 becomes glibberish in searches
This was indeed the solution. Thanks :) ----- Original Message ---- From: Andrey <alpha04@netvigator.com> To: xapian-discuss@lists.xapian.org Sent: Thursday, October 18, 2007 11:31:58 PM Subject: [Xapian-discuss] Re: UTF-8 becomes glibberish in searches Try on top of your php code header("Content-type: text/html; charset=utf-8"); "athlon athlonf" <athlonkmf@yahoo.com> wrote in message news:549946.81147.qm@web31011.mail.mud.yahoo.com... I'm using dbi2omega and scriptindex to index a database with chinese characters. Searches are done with php4-bindings. While the index-file is in utf8, the results from the searches are glibberish. These characters (changed to htmlencoding for this message) ?? becomes something like this: a?ao?a? What am I doing wrong here? Is it the indexing, or is it the searching? How can I check if the database is indeed in utf-8? I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2 of xapian. This is part of the code I use to get the results // Start an enquire session. $enquire = new XapianEnquire($database); $query_string = $_POST['terms']; $qp = new XapianQueryParser(); $stemmer = new XapianStem("english"); $qp->set_stemmer($stemmer); $qp->set_database($database); $qp->set_stemming_strategy(XapianQueryParser_STEM_SOME); $qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) ); $qp->set_default_op( OP_AND ); $query = $qp->parse_query($query_string); print "Parsed query is: " . $query->get_description(). "<br/>"; // Find the top 10 results for the query. $enquire->set_query($query); $enquire->set_sort_by_relevance_then_value(1,1); $matches = $enquire->get_mset(0, 10); // Display the results. print $matches->get_matches_estimated()." results found:\n"; echo "<pre>"; $i = $matches->begin(); while (!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $document = $i->get_document(); $data = $document->get_data(); foreach (split("\n", $data) as $line) { $nameval = split("=", $line, 2); $field[$nameval[0]] = $nameval[1]; } print_r($field); echo "$n: ". $i->get_percent()." % id=:". $i->get_docid(); __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ Xapian-discuss mailing list Xapian-discuss@lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com