Attached is my rather largish omindex.cc patch with ChangeLog. It needs autoreconf to update configure and the Makefiles. Note that unrar is not patent infected, only rar, the compressor. I've put some AC_PATH_PROG checks into configure for all helpers. The patch is not yet complete. 2006-08-18 15:13:32 Reini Urban <reinhard.urban at avl.com> omega-0.9.6b: * omindex.cc: last_mod as value. Add HAVE_UNRAR, HAVE_MSGCONVERT, HAVE_READPST, HAVE_CATDOC checks. Add options --verbose, --silent * configure.ac: Add HAVE_CATDOC 2006-08-17 18:06:26 Reini Urban <reinhard.urban at avl.com> omega-0.9.6a: * omindex.cc: Added last_mod check, cache_dir, libtextcat, cached virtual directories (zip,msg,pst,...). New options: -c/--nocleanup, -i/--ignore-time. Add MS-Office mimetypes (word, excel, powerpoint, outlook) * configure.ac: Add HAVE_TEXTCAT, HAVE_UNRAR, HAVE_MSGCONVERT, HAVE_READPST, HAVE_CATDOC * commonhelp.cc: Update stemmer help with HAVE_TEXTCAT (lang autodetection) * configfile.cc: New cache_dir * Makefile.am: Prepare for omindex_test. Link omindex against configfile. * langclass, langclass.conf: New file and directory * omindex.1: updated by help2man -- Reini Urban -------------- next part -------------- A non-text attachment was scrubbed... Name: omega-0.9.6b.patch.gz Type: application/x-gzip Size: 10832 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20060820/7f22ae94/attachment.bin>
On Sun, Aug 20, 2006 at 08:33:56PM +0200, Reini Urban wrote:> The patch is not yet complete.I've had a quick look through - I think there are some useful things in here. Let me know when you've finished tweaking it.> It needs autoreconf to update configure and the Makefiles. > Note that unrar is not patent infected, only rar, the compressor. > I've put some AC_PATH_PROG checks into configure for all helpers.This assumes that the filters installed at configure time are the same as those installed at run time, which isn't necessarily the case (for binary packaged versions, it's probably rarely true). I'd prefer to just run the filter anyway and check if it fails. I've just added some code to remove the ext->mime-type mapping when the filter fails because it couldn't be found, so we now effectively lazily probe the filters we want to use at run-time.> +AM_LDFLAGS = -no-undefinedSadly adding this unconditionally causes problems on some platforms (I forget which off the top of my head). Do you need it for cygwin?> +#define SAMPLE_WORDS 500This is actually the number of *CHARACTERS*, not words.> +#ifdef HAVE_TEXTCAT > + char * lang; > + lang = textcat_Classify( textcat, sample.c_str(), sample.length()+1 ); > + language = string(lang); > + if ((language != _TEXTCAT_RESULT_UNKOWN) // unknown language > + && (language != _TEXTCAT_RESULT_SHORT)) // too little information > + { > + if (language[0] == '[') { > + int pos = language.find(']',0); > + language = language.substr(1,pos-1); > + } > + record += "\nlanguage=" + language; > + if (language != curr_lang) { > + cout << "new language " << curr_lang << " => " << language << " "; > + stemmer = Xapian::Stem(language); > + curr_lang = language; > + } > + } > +#endifIf each document is stemmed in a potentially different language, how do you decide which stemmer to use at query time? Also, should documents which are categorised as "unknown" or "too short to determine" really just get the last used language? I can see that's sometimes a good choice, but in other cases it can be very arbitrary. It also means that such documents can get an entirely different langauge in an update (because the previously processed document could be a completely different one if only a few documents have changed). Cheers, Olly