Hello,I searched the mailing list but all the language problems seemed to
disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out,
I'm
under Windows XP, CSharp and .NET framework 3.5
The TermGenerator.IndexText treats some characters as separators, for
example german '?' or danish '?' so it splits words with them
into seperate
word parts while other letters are 'simplified' ? / ? -> o, ? ->
a. Indexing
text in Russian result in a non-readable index list (retreived later with
iterating through Document.TermListBegin .. End)
I wrote my own indexer that doesn't split those words but saves them with
AddPosting method, still when they are read from the database there are
'?'
(question signs) in places of '?' / '?' and o in place of ? / ?.
The other part of the problem is with the QueryParser which does the same
bad things to query terms. I searched xapian source code and found that it
requires UTF-8 encoded string as input, am I right ?
Query QueryParser::Internal::parse_query(const string &qs, unsigned
flags, string &default_prefix) {
...
Utf8Iterator it(qs), end;
So I made a small patch to the binding Query.cs / QueryParser.cs files to
allow me to override ParseQuery method so that I pass a utf-8 encoded string
to Xapian dll. With a debugger I trap process at dll
entrance _CSharp_QueryParser_ParseQuery__SWIG_1 and do make sure that passed
parameter is a UTF-8 encoded string. Still it doesn't help! Query term
iterator returns strange things instead of German/Danish/Russian characters
and search fails. Did someone manage the search to work in non-english
languages ?
On Tue, 1 Sep 2009 18:58:47 +0300, win wrote:> Did someone manage the search to work in non-english languages ?It works for me (Danish): * http://kammeratadam.dk/find/?q=Trov%C3%A6rdig ... but I'm using the Perl-bindings under Linux, so this datapoint might not be that valuable in your context. (Debian 5.0.2 (lenny), Xapian 1.0.7, Search::Xapian 1.0.7, Perl 5.10.0) Best regards, Adam -- "I think there are enough frivolous lawsuits in this Adam Sj?gren country without people fighting over pop songs." asjo at koldfront.dk
On Tue, Sep 01, 2009 at 06:58:47PM +0300, win 32 wrote:> Hello,I searched the mailing list but all the language problems seemed to > disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm > under Windows XP, CSharp and .NET framework 3.5Unicode support for the C# bindings has been tested and is known to work with Mono, but whether it works with Microsoft's implementation is an open question: http://xapian.org/docs/bindings/csharp/ It sounds like it doesn't work out of the box currently. But it's probably just a matter of working out how the conversions on string parameters and string return values need to be and it will all suddenly work.> I wrote my own indexer that doesn't split those words but saves them with > AddPosting method, still when they are read from the database there are '?' > (question signs) in places of '?' / '?' and o in place of ? / ?.It's unclear from this if the problem is passing strings to Xapian, or returning strings from Xapian (or both). Can you look at the database with the "delve" utility to see what terms actually get added? Cheers, Olly
All right, as I understand the default Windows .NET marshalling doesn?t work
as it translates CSharp Unicode strings to the user default ANSI charset
(Russian in my case) and it is UTF-8, not ANSI what Xapian wants. So I wrote
a custom marshaller to perform Unicode -> UTF-8 -> Unicode conversions:
class UTF8StringMarshaler : ICustomMarshaler
{
private static UTF8StringMarshaler marshaler = null;
public static ICustomMarshaler GetInstance(string cookie)
{
if (marshaler == null)
marshaler = new UTF8StringMarshaler();
return marshaler;
}
public IntPtr MarshalManagedToNative(Object ManagedObj)
{
if (!(ManagedObj is string))
throw new ArgumentException("The passed object is not a string",
"ManagedObj");
byte[] unicodeBytes = Encoding.Unicode.GetBytes((string)ManagedObj);
byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8,
unicodeBytes);
IntPtr result = Marshal.AllocHGlobal(utf8Bytes.Length + 1);
Marshal.Copy(utf8Bytes, 0, result, utf8Bytes.Length);
Marshal.WriteByte(result, utf8Bytes.Length, 0);
return result;
}
public void CleanUpNativeData(IntPtr pNativeData)
{
Marshal.FreeHGlobal(pNativeData);
}
public int GetNativeDataSize()
{
return 0;
}
public Object MarshalNativeToManaged(IntPtr pNativeData)
{
int length;
for (length = 0; Marshal.ReadByte(pNativeData, length) != 0; length++) ; //
get unmanaged string length
byte[] utf8Bytes = new byte[length];
Marshal.Copy(pNativeData, utf8Bytes, 0, length);
return Encoding.UTF8.GetString(utf8Bytes);
}
public void CleanUpManagedData(Object ManagedObj)
{
}
} // class UTF8StringMarshaler
Then I just broke incapsulation of the classes that I used to get pointers
to C++ objects and called the corresponding c++ dll entries like
[DllImport("_XapianSharp", EntryPoint =
"CSharp_Query_GetDescription")]
[return: MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef
typeof(UTF8StringMarshaler))]
public static extern string Query_GetDescription(HandleRef jarg1);
public static string GetDescriptionU(this Xapian.Query query)
{
string ret = Query_GetDescription(query.swigCPtr);
if (Xapian.XapianPINVOKE.SWIGPendingException.Pending) throw
Xapian.XapianPINVOKE.SWIGPendingException.Retrieve();
return ret;
}
[DllImport("_XapianSharp", EntryPoint
"CSharp_QueryParser_ParseQuery__SWIG_0")]
static extern IntPtr QueryParser_ParseQuery__SWIG_0(HandleRef jarg1,
[MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef
typeof(UTF8StringMarshaler))] string jarg2,
uint jarg3,
[MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef
typeof(UTF8StringMarshaler))] string jarg4);
public static Xapian.Query ParseQueryU(this Xapian.QueryParser queryParser,
string query_string, uint flags, string default_prefix)
{
Xapian.Query ret;
ret = new Xapian.Query(QueryParser_ParseQuery__SWIG_0(queryParser.swigCPtr,
query_string, flags, default_prefix), true);
if (Xapian.XapianPINVOKE.SWIGPendingException.Pending) throw
Xapian.XapianPINVOKE.SWIGPendingException.Retrieve();
return ret;
}
}
Maybe sombody custom with SWIG can create interface file for windows .net
using my marshaler code and the stated declaration attributes to generate
proper XapianPINVOKE.cs? I just get a bit lost in the heap of files although
will keep searching.