athlon athlonf
2007-Oct-18 20:47 UTC
[Xapian-discuss] UTF-8 becomes glibberish in searches
I'm using dbi2omega and scriptindex to index a database with chinese
characters.
Searches are done with php4-bindings.
While the index-file is in utf8, the results from the searches are
glibberish.
These characters (changed to htmlencoding for this message)
?? becomes something like this: ???????
What am I doing wrong here? Is it the indexing, or is it the searching?
How can I check if the database is indeed in utf-8?
I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2
of xapian.
This is part of the code I use to get the results
// Start an enquire session.
$enquire = new XapianEnquire($database);
$query_string = $_POST['terms'];
$qp = new XapianQueryParser();
$stemmer = new XapianStem("english");
$qp->set_stemmer($stemmer);
$qp->set_database($database);
$qp->set_stemming_strategy(XapianQueryParser_STEM_SOME);
$qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) );
$qp->set_default_op( OP_AND );
$query = $qp->parse_query($query_string);
print "Parsed query is: " . $query->get_description().
"<br/>";
// Find the top 10 results for the query.
$enquire->set_query($query);
$enquire->set_sort_by_relevance_then_value(1,1);
$matches = $enquire->get_mset(0, 10);
// Display the results.
print $matches->get_matches_estimated()." results found:\n";
echo "<pre>";
$i = $matches->begin();
while (!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$document = $i->get_document();
$data = $document->get_data();
foreach (split("\n", $data) as $line) {
$nameval = split("=", $line, 2);
$field[$nameval[0]] = $nameval[1];
}
print_r($field);
echo "$n: ". $i->get_percent()." % id=:".
$i->get_docid();
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
On Thu, Oct 18, 2007 at 12:47:21PM -0700, athlon athlonf wrote:> I'm using dbi2omega and scriptindex to index a database with chinese > characters. > Searches are done with php4-bindings. > > While the index-file is in utf8, the results from the searches are > glibberish. > > These characters (changed to htmlencoding for this message) > ?????? becomes something like this: ??????????I just see "?" and inverse "?" here in mutt I'm afraid...> What am I doing wrong here? Is it the indexing, or is it the searching?You need to step through the process, checking that everything is OK after each step. It could be dbi2omega is wrong, or scriptindex, or xapian itself, or the PHP bindings. First of all, I'd run dbi2omega redirected to a file, and then see if the UTF-8 is correct in that file.> How can I check if the database is indeed in utf-8?Use the "delve" utility (in xapian-core, examples/delve) to look at the terms for a few documents. If both dbi2omega and the database look OK, then it's probably the PHP bindings. If you're writing the results as a web page, have you set the character set of the webpage to UTF-8 correctly? Check what your web browser says its character set is. Cheers, Olly
Try
on top of your php code
header("Content-type: text/html; charset=utf-8");
"athlon athlonf" <athlonkmf@yahoo.com> wrote in message
news:549946.81147.qm@web31011.mail.mud.yahoo.com...
I'm using dbi2omega and scriptindex to index a database with chinese
characters.
Searches are done with php4-bindings.
While the index-file is in utf8, the results from the searches are
glibberish.
These characters (changed to htmlencoding for this message)
¦P¨Æ becomes something like this: a?ao?a¡M
What am I doing wrong here? Is it the indexing, or is it the searching?
How can I check if the database is indeed in utf-8?
I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2
of xapian.
This is part of the code I use to get the results
// Start an enquire session.
$enquire = new XapianEnquire($database);
$query_string = $_POST['terms'];
$qp = new XapianQueryParser();
$stemmer = new XapianStem("english");
$qp->set_stemmer($stemmer);
$qp->set_database($database);
$qp->set_stemming_strategy(XapianQueryParser_STEM_SOME);
$qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) );
$qp->set_default_op( OP_AND );
$query = $qp->parse_query($query_string);
print "Parsed query is: " . $query->get_description().
"<br/>";
// Find the top 10 results for the query.
$enquire->set_query($query);
$enquire->set_sort_by_relevance_then_value(1,1);
$matches = $enquire->get_mset(0, 10);
// Display the results.
print $matches->get_matches_estimated()." results found:\n";
echo "<pre>";
$i = $matches->begin();
while (!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$document = $i->get_document();
$data = $document->get_data();
foreach (split("\n", $data) as $line) {
$nameval = split("=", $line, 2);
$field[$nameval[0]] = $nameval[1];
}
print_r($field);
echo "$n: ". $i->get_percent()." % id=:".
$i->get_docid();
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
athlon athlonf
2007-Oct-21 19:37 UTC
[Xapian-discuss] Re: UTF-8 becomes glibberish in searches
This was indeed the solution. Thanks :)
----- Original Message ----
From: Andrey <alpha04@netvigator.com>
To: xapian-discuss@lists.xapian.org
Sent: Thursday, October 18, 2007 11:31:58 PM
Subject: [Xapian-discuss] Re: UTF-8 becomes glibberish in searches
Try
on top of your php code
header("Content-type: text/html; charset=utf-8");
"athlon athlonf" <athlonkmf@yahoo.com> wrote in message
news:549946.81147.qm@web31011.mail.mud.yahoo.com...
I'm using dbi2omega and scriptindex to index a database with chinese
characters.
Searches are done with php4-bindings.
While the index-file is in utf8, the results from the searches are
glibberish.
These characters (changed to htmlencoding for this message)
?? becomes something like this: a?ao?a?
What am I doing wrong here? Is it the indexing, or is it the searching?
How can I check if the database is indeed in utf-8?
I'm using a fresh install of ubuntu and therefor a fresch version 1.0.2
of xapian.
This is part of the code I use to get the results
// Start an enquire session.
$enquire = new XapianEnquire($database);
$query_string = $_POST['terms'];
$qp = new XapianQueryParser();
$stemmer = new XapianStem("english");
$qp->set_stemmer($stemmer);
$qp->set_database($database);
$qp->set_stemming_strategy(XapianQueryParser_STEM_SOME);
$qp->add_valuerangeprocessor( new XapianDateValueRangeProcessor(1) );
$qp->set_default_op( OP_AND );
$query = $qp->parse_query($query_string);
print "Parsed query is: " . $query->get_description().
"<br/>";
// Find the top 10 results for the query.
$enquire->set_query($query);
$enquire->set_sort_by_relevance_then_value(1,1);
$matches = $enquire->get_mset(0, 10);
// Display the results.
print $matches->get_matches_estimated()." results found:\n";
echo "<pre>";
$i = $matches->begin();
while (!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$document = $i->get_document();
$data = $document->get_data();
foreach (split("\n", $data) as $line) {
$nameval = split("=", $line, 2);
$field[$nameval[0]] = $nameval[1];
}
print_r($field);
echo "$n: ". $i->get_percent()." % id=:".
$i->get_docid();
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss@lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com