Hello,I searched the mailing list but all the language problems seemed to disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm under Windows XP, CSharp and .NET framework 3.5 The TermGenerator.IndexText treats some characters as separators, for example german '?' or danish '?' so it splits words with them into seperate word parts while other letters are 'simplified' ? / ? -> o, ? -> a. Indexing text in Russian result in a non-readable index list (retreived later with iterating through Document.TermListBegin .. End) I wrote my own indexer that doesn't split those words but saves them with AddPosting method, still when they are read from the database there are '?' (question signs) in places of '?' / '?' and o in place of ? / ?. The other part of the problem is with the QueryParser which does the same bad things to query terms. I searched xapian source code and found that it requires UTF-8 encoded string as input, am I right ? Query QueryParser::Internal::parse_query(const string &qs, unsigned flags, string &default_prefix) { ... Utf8Iterator it(qs), end; So I made a small patch to the binding Query.cs / QueryParser.cs files to allow me to override ParseQuery method so that I pass a utf-8 encoded string to Xapian dll. With a debugger I trap process at dll entrance _CSharp_QueryParser_ParseQuery__SWIG_1 and do make sure that passed parameter is a UTF-8 encoded string. Still it doesn't help! Query term iterator returns strange things instead of German/Danish/Russian characters and search fails. Did someone manage the search to work in non-english languages ?
On Tue, 1 Sep 2009 18:58:47 +0300, win wrote:> Did someone manage the search to work in non-english languages ?It works for me (Danish): * http://kammeratadam.dk/find/?q=Trov%C3%A6rdig ... but I'm using the Perl-bindings under Linux, so this datapoint might not be that valuable in your context. (Debian 5.0.2 (lenny), Xapian 1.0.7, Search::Xapian 1.0.7, Perl 5.10.0) Best regards, Adam -- "I think there are enough frivolous lawsuits in this Adam Sj?gren country without people fighting over pop songs." asjo at koldfront.dk
On Tue, Sep 01, 2009 at 06:58:47PM +0300, win 32 wrote:> Hello,I searched the mailing list but all the language problems seemed to > disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm > under Windows XP, CSharp and .NET framework 3.5Unicode support for the C# bindings has been tested and is known to work with Mono, but whether it works with Microsoft's implementation is an open question: http://xapian.org/docs/bindings/csharp/ It sounds like it doesn't work out of the box currently. But it's probably just a matter of working out how the conversions on string parameters and string return values need to be and it will all suddenly work.> I wrote my own indexer that doesn't split those words but saves them with > AddPosting method, still when they are read from the database there are '?' > (question signs) in places of '?' / '?' and o in place of ? / ?.It's unclear from this if the problem is passing strings to Xapian, or returning strings from Xapian (or both). Can you look at the database with the "delve" utility to see what terms actually get added? Cheers, Olly
All right, as I understand the default Windows .NET marshalling doesn?t work as it translates CSharp Unicode strings to the user default ANSI charset (Russian in my case) and it is UTF-8, not ANSI what Xapian wants. So I wrote a custom marshaller to perform Unicode -> UTF-8 -> Unicode conversions: class UTF8StringMarshaler : ICustomMarshaler { private static UTF8StringMarshaler marshaler = null; public static ICustomMarshaler GetInstance(string cookie) { if (marshaler == null) marshaler = new UTF8StringMarshaler(); return marshaler; } public IntPtr MarshalManagedToNative(Object ManagedObj) { if (!(ManagedObj is string)) throw new ArgumentException("The passed object is not a string", "ManagedObj"); byte[] unicodeBytes = Encoding.Unicode.GetBytes((string)ManagedObj); byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, unicodeBytes); IntPtr result = Marshal.AllocHGlobal(utf8Bytes.Length + 1); Marshal.Copy(utf8Bytes, 0, result, utf8Bytes.Length); Marshal.WriteByte(result, utf8Bytes.Length, 0); return result; } public void CleanUpNativeData(IntPtr pNativeData) { Marshal.FreeHGlobal(pNativeData); } public int GetNativeDataSize() { return 0; } public Object MarshalNativeToManaged(IntPtr pNativeData) { int length; for (length = 0; Marshal.ReadByte(pNativeData, length) != 0; length++) ; // get unmanaged string length byte[] utf8Bytes = new byte[length]; Marshal.Copy(pNativeData, utf8Bytes, 0, length); return Encoding.UTF8.GetString(utf8Bytes); } public void CleanUpManagedData(Object ManagedObj) { } } // class UTF8StringMarshaler Then I just broke incapsulation of the classes that I used to get pointers to C++ objects and called the corresponding c++ dll entries like [DllImport("_XapianSharp", EntryPoint = "CSharp_Query_GetDescription")] [return: MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef typeof(UTF8StringMarshaler))] public static extern string Query_GetDescription(HandleRef jarg1); public static string GetDescriptionU(this Xapian.Query query) { string ret = Query_GetDescription(query.swigCPtr); if (Xapian.XapianPINVOKE.SWIGPendingException.Pending) throw Xapian.XapianPINVOKE.SWIGPendingException.Retrieve(); return ret; } [DllImport("_XapianSharp", EntryPoint "CSharp_QueryParser_ParseQuery__SWIG_0")] static extern IntPtr QueryParser_ParseQuery__SWIG_0(HandleRef jarg1, [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef typeof(UTF8StringMarshaler))] string jarg2, uint jarg3, [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef typeof(UTF8StringMarshaler))] string jarg4); public static Xapian.Query ParseQueryU(this Xapian.QueryParser queryParser, string query_string, uint flags, string default_prefix) { Xapian.Query ret; ret = new Xapian.Query(QueryParser_ParseQuery__SWIG_0(queryParser.swigCPtr, query_string, flags, default_prefix), true); if (Xapian.XapianPINVOKE.SWIGPendingException.Pending) throw Xapian.XapianPINVOKE.SWIGPendingException.Retrieve(); return ret; } } Maybe sombody custom with SWIG can create interface file for windows .net using my marshaler code and the stated declaration attributes to generate proper XapianPINVOKE.cs? I just get a bit lost in the heap of files although will keep searching.