thr3ads.net - Xapian devel - [Xapian-devel] omindex patch [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Reini Urban

2006-Aug-20 18:33 UTC

[Xapian-devel] omindex patch

Attached is my rather largish omindex.cc patch with ChangeLog.

It needs autoreconf to update configure and the Makefiles.
Note that unrar is not patent infected, only rar, the compressor.
I've put some AC_PATH_PROG checks into configure for all helpers.

The patch is not yet complete.

2006-08-18 15:13:32 Reini Urban <reinhard.urban at avl.com>

	omega-0.9.6b:
	* omindex.cc: last_mod as value. Add HAVE_UNRAR,
         HAVE_MSGCONVERT, HAVE_READPST, HAVE_CATDOC checks.
	Add options --verbose, --silent
	* configure.ac: Add HAVE_CATDOC
	
2006-08-17 18:06:26 Reini Urban <reinhard.urban at avl.com>

	omega-0.9.6a:
	* omindex.cc: Added last_mod check, cache_dir, libtextcat,
	cached virtual directories (zip,msg,pst,...).
	New options: -c/--nocleanup, -i/--ignore-time.
	Add MS-Office mimetypes (word, excel, powerpoint, outlook)
	* configure.ac: Add HAVE_TEXTCAT, HAVE_UNRAR, HAVE_MSGCONVERT,
	HAVE_READPST, HAVE_CATDOC
	* commonhelp.cc: Update stemmer help with HAVE_TEXTCAT (lang
         autodetection)
	* configfile.cc: New cache_dir
	* Makefile.am: Prepare for omindex_test. Link omindex against
         configfile.
	* langclass, langclass.conf: New file and directory
	* omindex.1: updated by help2man
-- 
Reini Urban
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omega-0.9.6b.patch.gz
Type: application/x-gzip
Size: 10832 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20060820/7f22ae94/attachment.bin>

Olly Betts

2006-Sep-02 03:57 UTC

head link

[Xapian-devel] omindex patch

On Sun, Aug 20, 2006 at 08:33:56PM +0200, Reini Urban
wrote:> The patch is not yet complete.
I've had a quick look through - I think there are some useful things in
here.  Let me know when you've finished tweaking it.
> It needs autoreconf to update configure and the Makefiles.
> Note that unrar is not patent infected, only rar, the compressor.
> I've put some AC_PATH_PROG checks into configure for all helpers.
This assumes that the filters installed at configure time are the same
as those installed at run time, which isn't necessarily the case (for
binary packaged versions, it's probably rarely true).

I'd prefer to just run the filter anyway and check if it fails.  I've
just added some code to remove the ext->mime-type mapping when the
filter fails because it couldn't be found, so we now effectively lazily
probe the filters we want to use at run-time.
> +AM_LDFLAGS = -no-undefined
Sadly adding this unconditionally causes problems on some platforms (I
forget which off the top of my head).  Do you need it for cygwin?
> +#define SAMPLE_WORDS  500
This is actually the number of *CHARACTERS*, not words.
> +#ifdef HAVE_TEXTCAT
> +    char * lang;
> +    lang = textcat_Classify( textcat, sample.c_str(), sample.length()+1 );
> +    language = string(lang);
> +    if ((language != _TEXTCAT_RESULT_UNKOWN) // unknown language
> +	&& (language != _TEXTCAT_RESULT_SHORT)) // too little information
> +    {
> +	if (language[0] == '[') {
> +	    int pos = language.find(']',0);
> +	    language = language.substr(1,pos-1);
> +	}
> +	record += "\nlanguage=" + language;
> +	if (language != curr_lang)  {
> +	    cout << "new language " << curr_lang <<
" => " << language << " ";
> +	    stemmer = Xapian::Stem(language);
> +	    curr_lang = language;
> +	}
> +    }
> +#endif
If each document is stemmed in a potentially different language, how do
you decide which stemmer to use at query time?

Also, should documents which are categorised as "unknown" or "too
short
to determine" really just get the last used language?  I can see that's
sometimes a good choice, but in other cases it can be very arbitrary.
It also means that such documents can get an entirely different langauge
in an update (because the previously processed document could be a
completely different one if only a few documents have changed).

Cheers,
    Olly

Maybe Matching Threads

Search for more maybe matching threads

Xapian devel - Aug 2006 - omindex patch

[Xapian-devel] omindex patch

[Xapian-devel] omindex patch

Maybe Matching Threads